Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/uct: reduce number of messages sent when establishing connections #13018

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Jan 5, 2025

The btl/uct code can be quite aggressive at sends connection messages over the connection endpoint. This could lead to a large number of unnecessary messages in some cases. This commit adds code to restrict the retry rate to 2ms. This timing is controlled by a new MCA variable: btl_uct_connection_retry_timeout.

The btl/uct code can be quite aggressive at sends connection messages over the
connection endpoint. This could lead to a large number of unnecessary messages
in some cases. This commit adds code to restrict the retry rate to 2ms. This
timing is controlled by a new MCA variable: btl_uct_connection_retry_timeout.

Signed-off-by: Nathan Hjelm <[email protected]>
@hppritcha
Copy link
Member

is there a simple test which demonstrates the problem that this PR is addressing?

@hjelmn
Copy link
Member Author

hjelmn commented Jan 13, 2025

UD is a bit different on our hardware and the extra message don't break things per-se but I added logging of UD sends and, for a 384 (2 * 192 ppn) process run, there were over 250,000 UD messages in a mostly connected case when there should have been no more than 2 * 192^2 (each process should send one for each off-node connection). With this fix the number dropped to 20k because it was not fully connecting.

I may actually axe the retry entirely since UD and other similar TLs are treated as reliable by UCT. I have another change coming in to clean up the code a bit (leaving the retry) which has been fully tested but I need some time to test without the retry mechanism.

Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't really test this but built for me and passed basic multi-node sanity using ob1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants